All coding steps will be done using Python. If you need help on setting up your machine, please refer to this link for help
Before you start, make sure to import and load all the necessary packages:
import pandas as pdimport numpy as npimport statsmodels.api as smfrom matplotlib import pyplot as pltfrom plotnine import*from great_tables import GT, mdfrom mizani.formatters import percent_formatfrom scipy.stats import chi2from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score
Evaluating Performance: Statistical Tests
In our previous discussions, we saw that \(R^2\), which was our measure of overall goodness-of-fit for linear regression models, really does not convey any relevant information for models like Logit
Furthermore, the usual t-tests, used to test hypotheses around \(\hat{\beta}\), is not applicable here
Logistic regression assumes errors follow the logistic distribution
Consequently, the term \(\dfrac{(\hat{\beta}-\beta_0)}{se(\hat{\beta})}\) does not follow a t-distribution
How can we test make hypothesis around Logit models and assess overall accuracy?
We’ll make use of the fact that Logit models are estimated using a likelihood function, \(\mathcal{L}\), in order to derive some important evaluation metrics
Hands-on Exercise
Description
As you progress through your work on the bank CRM dataset, there are a couple of questions that are still oustanding:
How do Logit estimates translate to economic interpretation?
How can we assess the performance of this classifier model and compare that to other models?
Are there any ways for testing whether a specific variable improves the model performance?
We will continue to use the same dataset as before - you can download the bank-dataset.csv data using the Download button below
Note: this dataset is be primarily based on thisKaggle notebook, although some adaptations have been made for teaching purposes
A model like logit is estimated using a maximum likelihood method. The likelihood of churning for a given individual \(i\), with observations \((y_i,x_i)\) can be written as
Because we are assuming that all churn observations (i.e, customer decisions) are i.i.d, then the likelihood of the entire sample is just the product of the individual likelihoods:
In words, a maximum-likelihood estimator is trying to find the coefficients \(\beta\) that make the sample of churn observations \((y_1,y_2,...,y_n)=Y\) more likely to occur!
What happens on the back-end is that a maximum likelihood estimator will try to find the set of parameters \(\beta\) that maximize the log-likelihood of the model, noting that the \(\log\) is a monotonic function
At the end of the day, when we want to compare models, we can think about which model has the highest log-likelihood
There are three common tests that can be used to test this type of question: the Likelihood ratio (LR) test, the Wald test, and the Lagrange multiplier test (sometimes called a score test)
These tests are sometimes described as tests for differences among nested models, because one of the models can be said to be nested within the other:
The null hypothesis for all three tests is that the “smaller”, or “restricted” model, is the “true” model
On the other hand, a large test statistics indicate that the null hypothesis is false
On the one hand, values of \(\beta\) that are closer to the “true” relationship between the matrix \(\Lambda(X)\) of covariates and \(Y\)increase the log-likelihood
On the other hand, values of \(\beta\) that are unlikely to describe the sample relationship between \(\Lambda(X)\) and \(Y\)decrease the log-likelihood
A graphical intuition of the maximum likelihood estimator
Since \(X\) and \(Y\) are given, the only way to change the \(\ell(\beta)\) is to change the estimates of the coefficients until we find the vector \(\beta^\star\) that maximizes the probability (likelihood) of describing the sample
To the purposes of our analysis, a constrained case, as a baseline, can be the case where the \(Y\) has no predictors (i.e, all \(\beta\)’s are equal to zero)
The Wald Test
Definition
The Wald Test focus on whether a set of coefficients \(\hat{\beta}\) are collectively significantly different from a given \(\beta_0\) vector:
The Wald test can be used to test a single hypothesis on multiple parameters, as well as to test jointly multiple hypotheses on single/multiple parameters:
The Wald test helps determine if a set of independent variables are collectively significant in a model or if individual variables add value to the model
Asymptotically, \(\mathcal{W}\) follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions
The Lagrange Multiplier Test
Definition
The [Lagrange Multiplier] test evaluates if including an additional parameter significantly improves the model. Based on the score function, \(S(\beta)\), which is the first derivative of the log-likelihood, the test-statistic is defined as:
\[
LM = S(\hat{\beta}_0)^\top I(\hat{\beta}_0)^{-1} S(\hat{\beta}_0)
\] where \(I(\hat{\beta}_0)^{-1}\) is the Fisher Information Criteria, or the inverse of the Hessian Matrix of \(\hat{\beta}\)
In a given way, the Lagrange Multiplier analyses the slope of the first derivative of the maximum likelihood function:
Large values of the test statistic indicates that the slope vector departs significantly from 0 - there is still improvements to be made so as to maximize the log-likelihood
Asymptotically, the test statistic follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions
Likelihood-Ratio Test
Definition
The [Likelihood Ratio] compares the likelihoods of restricted and unrestricted models:
\[
LR = 2 \big[\ell(\hat{\beta}_0) - \ell(\hat{\beta})\big]
\]
It compares the gain in the log-likelihood if we use an unrestricted model
Asymptotically, the test statistic follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions
# Perform likelihood ratio test (LRT)lr_test = reg.llrdf = reg.df_modelp_value = reg.llr_pvalueprint("Likelihood Ratio Test:")print("Chi-square:", lr_test.round(0))print("Degrees of freedom:", df)print("P-value:", p_value.round(2))
McFadden’s pseudo-\(R^2\) is an alternative metric for assessing a model’s performance that also takes into account the use of the log-likelihood function:
How the estimated results compare to actual choices from customers?
On the one hand, Logit predicted values relate to estimated probabilities
On the other hand, actual information on churn is binary
From a practical perspective, one needs to map the estimated probabilities onto a categorization:
\[
\hat{Y}=
\begin{cases}
1 \text{, if } p>p^\star\\
0 \text{, if } p\leq p^\star
\end{cases}
\]
But how do we pick \(p^\star\)?
Introducing the Confusion Matrix
A way to assess the choice of \(p^\star\) is to analyze the confusion matrix:
It shows how much predictions were correct by each categorization: true positives and true negatives
It also shows how much predictions were incorrect by each categorization: false positives and false negatives
If we agnostically set \(p^\star=0.2037\), which is the sample average of churn, our example would yield the following terms for us:
The number of actual churned customers that were ex-ante classified as churned
The number of actual churned customers that were ex-ante classified as not churned
The number of actual not churned customers that were ex-ante classified as churned
The number of actual not churned customers that were ex-ante classified as not churned
Confusion Matrix
Diagonal cells indicate the True Positive and True Negative cases
Off-Diagonal cells indicate the False Positive and False Negative cases
Type I Errors (\(2,477\)) are the False Positive cases: we wrongly classified customers that did not churn as churned!
Type II Errors (\(650\)) are the False Negative cases: we wrongly classified customers that did churn as not churned!
How should a good estimator look like? Looking at Accuracy
Overall, a good estimator should minimize the combination of Type I and Type II errors. In other words, we want our estimator to have the highest Accuracy as possible:
Notwithstanding, there might be cases where the costs attributed to Type I and Type II errors are fairly different!
For example, sharing a discount coupon as a way to avoid losing a customer that was predicted to churn (Type I) while, in reality, it would not churn even without the coupon, has a much less significant cost
Notwithstanding, losing a customer because we did not identify that he could churn (Type II) has a significant cost to the business
Sensitivity
The first question that will likely pop up during internal conversations is: how much of the churned customers we were able to correctly identify?
This question is of special interest in churn analysis, as the goal is to target these customers before they actually have their final decision!
The Sensivity (or the True Positive Rate) calculates the proportion of correctly identified churned customers by comparing the number of true positives with the total number of positives:
Overall, it seems that our model goes a decent job in identifying \(68\%\) of the actual churned customers ahead of time!
Precision
The naivest way of identifying all churned customers is to set \(p^\star=0\). In other words, if we classify all customers as churned, then, for sure, we’ll get all churned customers right!
Notwithstanding, we are wrongly classifying some customers that would not churn in the future (false negatives). In practice, if we were to give coupons to every churn customer in potential, this action would cost us much more as we’re wasting money on customers that wouldn’t churn!
The Precision calculates the how precise our churn classification was by comparing the number of true positives with the total number of predicted positives:
Although we hit a high number of churned customers, we have wrongly classified \(1-35\%=65\%\) of customers as potential churners!
Specificity
The analog of the first question relates to how much of the non-churned customers we were able to correctly identify
Knowing how much non-churned customers our model predicts shed light on how much we’re able to understand about customers that do not churn!
The Specificity (or True Negative Rate) calculates the proportion of correctly identified non-churned customers by comparing the number of true negatives with the total number of negatives:
The last piece that is still left to analyze is the precision of our estimates for non-churned classifications
In other words, out of all the non-churn predicted customers, how much were actually false negatives?
The Negative Predicted Value calculates the proportion of correctly identified non-churned customers by comparing the number of true negatives with the predicted negatives:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000002AE9F7C6BA0>
# Predict churn using the trained logistic regression modelpredicted_churn = ( (reg.predict() >= y.mean()) # Assuming a threshold of 0.5 for classification .astype(int) )#Estimate the Confusion MatrixCM=confusion_matrix(y,predicted_churn)#Chart ConfusionMatrixDisplay.from_predictions(y, predicted_churn)#Titleplt.title('Confusion Matrix')#Showplt.show()
Putting all together
As we increase our threshold, \(p^\star\), we minimize Type II error (i.e, we identify the churned customers) as we’re identifying the true positives
At the same time, however, we are increasing the Type I error, since there is going to be a higher number of false positives!
A Receiver Operating Characteristic curve (or simply ROC curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
The True Positive Rate (or TPR)
The False Positive Rate (or FPR)
The AUC provides an aggregate measure of performance across all possible classification thresholds
However, such metric is not very applicable metric whenever the costs of Type I and Type II errors are very different
# Define independent and dependent variables for Reg_1 (without 'age')X1 = X.drop('age',axis=1)# Define independent and dependent variables for Reg_2 (with 'age')X2 = X# Dependent variabley = data['churn']# Fit logistic regression modelsreg_1 = model = sm.Logit(y, X1.astype(float)).fit()reg_2 = model = sm.Logit(y, X2.astype(float)).fit()# Predict probabilities for both modelsprobs_1 = reg_1.predict()probs_2 = reg_2.predict()# Compute ROC curvesfpr_1, tpr_1, _ = roc_curve(y, probs_1)fpr_2, tpr_2, _ = roc_curve(y, probs_2)# Compute AUC scoresauc_1 = roc_auc_score(y, probs_1)auc_2 = roc_auc_score(y, probs_2)# Plot ROC curvesData=pd.concat([pd.DataFrame({'Model':'Without Age','FPR':fpr_1,'TPR':tpr_1}), pd.DataFrame({'Model':'With Age','FPR':fpr_2,'TPR':tpr_2})],axis=0)Plot = ( ggplot(Data,aes(x='FPR',y='TPR',fill='Model'))+ geom_point(stroke=0)+ theme_minimal()+ annotate(geom='text',x=0.5,y=0.5,label=('Model 1 (without Age): '+ auc_1.round(2).astype('str')))+ annotate(geom='text',x=0.4,y=0.9,label=('Model 2 (with Age): '+ auc_2.round(2).astype('str')))+ labs(x='FPR (1 - Specificity)', y='TPR (Sensitivity)', title='Model with Age outperforms the other for any threshold!')+ theme(plot_title = element_text(size=12,face='bold'), axis_text = element_text(size=10), axis_title = element_text(size=12), legend_position ='bottom', figure_size=(15,6)))Plot.show()
Bridging Econometrics with Machine Learning
Potential topics that you may want to dive in when looking at binary choice models:
Compare different models in terms of predictive power, statistics etc, such as Probit, Random Forests, Suppor Vector Machines
Fine-tune the metrics to optimize your classification results based on the question that you’re aiming to solve
Using train/test splits and balanced samples
Using cross-validation folds
Comparing the classification performance across different [models specifications]^[For R, refer to the caret package. For Python, refer to scikit-learn
All in all, there is a growing literature on the role of machine learning methods for supervised learning applied to classification contexts1